# Multimodal generation

Omnigen2
Apache-2.0
OmniGen2 is a powerful and efficient unified multimodal model composed of a 3B vision-language model and a 4B diffusion model, supporting visual understanding, text-to-image generation, instruction-guided image editing, and context generation.
Text-to-Image
O
OmniGen2
136
5
Fashion BLIP
Apache-2.0
BLIP is a Transformer-based image-to-text generation model that can generate natural language descriptions for input images.
Image-to-Text
F
kzap201
585
0
Skyreels V2 DF 1.3B 540P
Other
SkyReels V2 is the first open-source video generation model adopting an autoregressive diffusion forced architecture, supporting unlimited-length movie generation, achieving state-of-the-art performance among public models.
Video Processing Safetensors
S
Skywork
600
23
SIMS 7B
MIT
A speech-language model based on Qwen2.5-7B extension, supporting speech-text interleaved training and cross-modal generation
Text-to-Audio Transformers English
S
slprl
51
1
Qwen2 Vl Instuct Bpmncoder
Apache-2.0
4-bit quantized version based on Qwen2-VL-7B model, trained using Unsloth and Huggingface TRL library, achieving 2x inference speedup
Text-to-Image Transformers English
Q
utkarshkingh
18
1
Swin Distilbertimbau
MIT
Brazilian Portuguese image captioning model based on Swin Transformer and DistilBERTimbau
Image-to-Text Transformers Other
S
laicsiifes
18
3
Show O W Clip Vit
MIT
Show-o is a PyTorch-based any-to-any conversion model focused on multimodal task processing.
Text-to-Image
S
showlab
18
2
Show O
MIT
Show-o is an any-to-any conversion model based on PyTorch, supporting input and output conversion across multiple modalities.
Text-to-Video
S
showlab
225
16
MGM 7B
MGM-7B is an open-source multimodal chatbot trained on Vicuna-7B-v1.5, supporting high-definition image understanding, reasoning, and generation.
Text-to-Image Transformers
M
YanweiLi
975
8
Hkjk
MIT
A text-to-video generation model based on the AllenNLP library, capable of generating corresponding video content according to the input text description.
Text-to-Video
H
MileAway
0
0
Chexagent 8b
CheXagent is a foundational model for chest X-ray interpretation, which can assist the medical field in professionally interpreting chest X-ray images.
Image-to-Text Transformers
C
StanfordAIMI
1,020
40
Vit Roberta Fa Image Captioning Flickr30k
A Persian image captioning model based on ViT+RoBERTa architecture, specifically designed to generate Persian text descriptions from images
Image-to-Text Other
V
hezarai
85
1
Blip Base Captioning Ft Hl Narratives
Apache-2.0
BLIP model fine-tuned on HL Narratives dataset for generating high-level narrative image descriptions
Image-to-Text Transformers English
B
michelecafagna26
61
1
Blip Base Captioning Ft Hl Scenes
Apache-2.0
This model is an image captioning model based on the BLIP architecture, specifically fine-tuned for high-level scene descriptions.
Image-to-Text Transformers English
B
michelecafagna26
13
0
Polsk
This model can convert text descriptions into video content and is suitable for various creative and automated scenarios.
Text-to-Video
P
Tyffuss86
0
0
Thinksites
MIT
This is a text-to-video model that can convert the input text description into corresponding video content.
Text-to-Video
T
thinkamconnect
0
0
Text2video Zero Controlnet Canny Arcane
Openrail
Text2Video-Zero is a zero-shot text-to-video tool supporting edge guidance and mystical style
Text-to-Video
T
PAIR
39
31
Vit Rugpt2 Image Captioning
This is an image captioning model trained on a translated version (English-Russian) of the COCO2014 dataset, capable of generating Russian descriptions for input images.
Image-to-Text Transformers Other
V
tuman
111
13
Igpt Fr Cased Base
Apache-2.0
A French incremental pre-trained language model based on GPT-fr with text-to-image generation capabilities
Text-to-Image Transformers French
I
asi
64
5
Molt5 Large
Apache-2.0
MolT5 is a large language model for translation between molecules and natural language, built on the T5 architecture.
Molecular Model Transformers
M
laituan245
196
1
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase